This is a much shorter and less philosophical lesson than the previous lessons but hopefully is very useful when considering how to pull data into R.
Base functions for reading and writing files
Reading files
R has solid built-in functions for importing data from files with the read.table() family of functions. read.table() is the generic form that expects a filename (in quotes) at a minimum and, importantly, an indication of the separator character used - it defaults to "" which indicates white space (one or more spaces, tabs, newlines, or carriage returns). The default header parameter for read.table() is FALSE, meaning that the function will not use the first row to determine column names. Because non-Excel tabular files are generally comma-delimited or tab-delimited with a first row header, read.csv() and read.delim() are the go-to base file reading functions that include a header = TRUE parameter and use comma and tab delimting, respectively, by default.
There are a variety of other useful parameters to consider, including explicitly supplying the column names via the col.names parameter (if not defined in header, for example). One related group of parameters to be conscious of with these functions are stringsAsFactors and colClasses. When R is reading a file, it will convert each column to a specific data type based on the content within that column. The default behavior of R is to convert columns with non-numeric data into a factor, which are a representation of categorical variables. For example, you may want to separate out data by sex (M/F) or between three instruments A, B, and C, and it makes perfect sense to represent these as a factor, so that you can easily stratify the groups during analyses in R, particularly for modeling questions. So, by default, with these base functions stringsAsFactors = TRUE, which means that any columns with characters may not have the expected behavior when you analyze the data. In general this may not be a big deal but can cause problems in a couple scenarios: 1. You are expecting a column to be a string to parse the data (using the stringr package for example). Not a huge deal - you can convert to a character 2. There are typos or other data irregularities that cause R to interpret the column as a character and then automatically convert to a factor. If you are not careful and attempt to convert this column back to a numeric type (using as.numeric() for example), you can end up coverting the column to a completely different set of numbers! That is because factors are represented as integers within R, and using a function like as.numeric() will convert the value to its backend factor integer representation. So c(20, 4, 32, 5) could become c(1, 2, 3, 4) and you may not realize it.
Problem #2 will come back to haunt you if you are not careful. The brute force defense mechanism is to escape the default behavior: read.csv("file_name.csv", stringsAsFactors = FALSE). This will prevent R from converting any columns with characters into factors. However, you may want some of your columns to be represented as factors. You can modify behavior on a column by column basis. read.csv("file_name.csv", colClasses = c("character", "factor", "integer") will set a 3 column csv file to character, factor, and integer data types in that column order.
To be safe, the best practice is arguably to explicitly define column types when you read in a file. It is a little extra work up front but can save you some pain later on.
For the curious, additional information about the history of of stringsAsFactors can be found here.
Exercise 1
Let’s run through the base reading function with a csv.
- Use the base
read.csv() function to read the “2017-01-06_s.csv” file in the data folder into a data frame.
base_load <- read.csv("data/2017-01-06_s.csv")
- What is the internal structure of the object? (Recall the
str() command to quickly view the structure.)
str(base_load)
## 'data.frame': 187200 obs. of 10 variables:
## $ batchName : Factor w/ 600 levels "b100302","b101197",..: 540 540 540 540 540 540 540 540 540 540 ...
## $ sampleName : Factor w/ 24128 levels "s000001","s000002",..: 6605 6605 6605 6605 6605 6605 6606 6606 6606 6606 ...
## $ compoundName : Factor w/ 6 levels "codeine","hydrocodone",..: 4 3 6 1 2 5 4 3 6 1 ...
## $ ionRatio : num 0 0 0 0 0 0 0 0 0 0 ...
## $ response : num 0 0 0 0 0 0 0 0 0 0 ...
## $ concentration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sampleType : Factor w/ 4 levels "blank","qc","standard",..: 1 1 1 1 1 1 3 3 3 3 ...
## $ expectedConcentration: int 0 0 0 0 0 0 0 0 0 0 ...
## $ usedForCurve : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ samplePassed : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
- Summarize the data. (Recall the
summary() function to view column types and characteristics about the data.)
summary(base_load)
## batchName sampleName compoundName ionRatio
## b100302: 312 s035001: 24 codeine :31200 Min. :0.0000
## b101197: 312 s035002: 24 hydrocodone :31200 1st Qu.:0.0000
## b101972: 312 s035003: 24 hydromorphone:31200 Median :0.8537
## b102100: 312 s035004: 24 morphine :31200 Mean :0.6689
## b102508: 312 s035005: 24 oxycodone :31200 3rd Qu.:1.2479
## b103050: 312 s035006: 24 oxymorphone :31200 Max. :1.7199
## (Other):185328 (Other):187056
## response concentration sampleType expectedConcentration
## Min. :0.0000 Min. : 0.00 blank : 7200 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 0.00 qc : 10800 1st Qu.: 0.00
## Median :0.3128 Median : 44.24 standard: 25200 Median : 0.00
## Mean :0.9647 Mean :135.83 unknown :144000 Mean : 35.77
## 3rd Qu.:1.8650 3rd Qu.:264.20 3rd Qu.: 0.00
## Max. :5.9517 Max. :857.16 Max. :500.00
##
## usedForCurve samplePassed
## Mode :logical Mode :logical
## FALSE:162803 FALSE:4678
## TRUE :24397 TRUE :182522
##
##
##
##
- Repeat the previous steps starting with #2, but include the argument
stringsAsFactors = FALSE when you read in the data.
base_load_nofactors <- read.csv("data/2017-01-06_s.csv",
stringsAsFactors = FALSE)
str(base_load_nofactors)
## 'data.frame': 187200 obs. of 10 variables:
## $ batchName : chr "b802253" "b802253" "b802253" "b802253" ...
## $ sampleName : chr "s253001" "s253001" "s253001" "s253001" ...
## $ compoundName : chr "morphine" "hydromorphone" "oxymorphone" "codeine" ...
## $ ionRatio : num 0 0 0 0 0 0 0 0 0 0 ...
## $ response : num 0 0 0 0 0 0 0 0 0 0 ...
## $ concentration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sampleType : chr "blank" "blank" "blank" "blank" ...
## $ expectedConcentration: int 0 0 0 0 0 0 0 0 0 0 ...
## $ usedForCurve : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ samplePassed : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
summary(base_load_nofactors)
## batchName sampleName compoundName ionRatio
## Length:187200 Length:187200 Length:187200 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :0.8537
## Mean :0.6689
## 3rd Qu.:1.2479
## Max. :1.7199
## response concentration sampleType
## Min. :0.0000 Min. : 0.00 Length:187200
## 1st Qu.:0.0000 1st Qu.: 0.00 Class :character
## Median :0.3128 Median : 44.24 Mode :character
## Mean :0.9647 Mean :135.83
## 3rd Qu.:1.8650 3rd Qu.:264.20
## Max. :5.9517 Max. :857.16
## expectedConcentration usedForCurve samplePassed
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 0.00 FALSE:162803 FALSE:4678
## Median : 0.00 TRUE :24397 TRUE :182522
## Mean : 35.77
## 3rd Qu.: 0.00
## Max. :500.00
- For this data set, which fields should be strings and which should be factors?
End Exercise
Writing files
The functions for reading files in base R have equivalents for writing files as well: write.table() and write.csv(). The first argument in these functions is the data frame or matrix to be written and the second argument is the file name (in quotes).
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
There are a few other important parameters: - sep indicates the field separator (“” for tab) - row.names is set to TRUE by default - in general this makes for an ugly output file becuase the first column shows the row number (I almost always set this to FALSE when I use the base function) - na indicates the string to use for missing data and is set to R’s standard of “NA” by default - append can be set to TRUE if you would like to append your data frame/matrix to an existing file
Speeding things up with the readr package
Base R functions get the job done, but they have some weaknesses: - they are slow for reading large files (slow compared to?) - the automatic conversion of strings to factors by default can be annoying to turn off - output with row names by default can be annoying to turn off
One package in the tidyverse family meant to address these issues is readr. This package provides functions similar to the base R file reading functions, with very similar function names: read_csv() (instead of read.csv()) or read_delim() for example. Tab-delimited files can be read in with read_tsv(). These functions are ~10x faster at reading in files than the base R functions and do not automatically convert strings to factors. Readr functions also provide a helpful syntax for explicitly defining column types:
# purely a dummy example, not executable!
imaginary_data_frame <- read_csv(
"imaginary_file.csv",
col_types = cols(
x = col_integer(),
y = col_character(),
z = col_datetime()
)
)
Another advantage of these functions is that they actually explicitly tell you how the columns were parsed when you import (as we’ll see in the exercise).
Readr also offers equivalent write functions such as write_csv() and write_tsv(). There is a variant of write_csv() specifically for csv files intended to be read with Excel: write_excel_csv(). These functions do not write row names by default.
Exercise 2
Now let’s run through using the readr function for a csv: 1. Use the read_csv() function to read the “2017-01-06_s.csv” file into a data frame.
readr_load <- read_csv("data/2017-01-06_b.csv")
## Parsed with column specification:
## cols(
## batchName = col_character(),
## instrumentName = col_character(),
## compoundName = col_character(),
## calibrationSlope = col_double(),
## calibrationIntercept = col_double(),
## calibrationR2 = col_double(),
## batchPassed = col_logical(),
## reviewerName = col_character(),
## batchCollectedTimestamp = col_datetime(format = ""),
## reviewStartTimestamp = col_datetime(format = ""),
## reviewCompleteTimestamp = col_datetime(format = "")
## )
- What is the internal structure of the object?
str(readr_load)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3600 obs. of 11 variables:
## $ batchName : chr "b802253" "b802253" "b802253" "b802253" ...
## $ instrumentName : chr "doc" "doc" "doc" "doc" ...
## $ compoundName : chr "morphine" "hydromorphone" "oxymorphone" "codeine" ...
## $ calibrationSlope : num 0.00775 0.00768 0.00798 0.00819 0.00656 ...
## $ calibrationIntercept : num -5.76e-06 -4.38e-05 5.15e-06 -5.02e-06 4.56e-05 1.26e-05 -7.96e-05 4.00e-05 -2.48e-05 -3.95e-05 ...
## $ calibrationR2 : num 0.997 0.981 0.984 0.981 0.991 ...
## $ batchPassed : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ reviewerName : chr "Xavier" "Xavier" "Xavier" "Xavier" ...
## $ batchCollectedTimestamp: POSIXct, format: "2017-01-06 21:40:00" "2017-01-06 21:40:00" ...
## $ reviewStartTimestamp : POSIXct, format: "2017-01-07 13:43:00" "2017-01-07 13:43:00" ...
## $ reviewCompleteTimestamp: POSIXct, format: "2017-01-07 14:09:00" "2017-01-07 14:09:00" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 11
## .. ..$ batchName : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ instrumentName : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ compoundName : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ calibrationSlope : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ calibrationIntercept : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ calibrationR2 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ batchPassed : list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## .. ..$ reviewerName : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ batchCollectedTimestamp:List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"
## .. ..$ reviewStartTimestamp :List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"
## .. ..$ reviewCompleteTimestamp:List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
- Summarize the data.
summary(readr_load)
## batchName instrumentName compoundName
## Length:3600 Length:3600 Length:3600
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## calibrationSlope calibrationIntercept calibrationR2 batchPassed
## Min. :0.003172 Min. :-9.510e-05 Min. :0.9800 Mode:logical
## 1st Qu.:0.006794 1st Qu.:-2.160e-05 1st Qu.:0.9860 TRUE:3600
## Median :0.007060 Median : 6.965e-08 Median :0.9902
## Mean :0.007107 Mean : 7.202e-08 Mean :0.9899
## 3rd Qu.:0.007351 3rd Qu.: 2.180e-05 3rd Qu.:0.9938
## Max. :0.009626 Max. : 1.082e-04 Max. :1.0000
## reviewerName batchCollectedTimestamp
## Length:3600 Min. :2017-01-06 20:08:00
## Class :character 1st Qu.:2017-01-13 22:55:15
## Mode :character Median :2017-01-21 11:00:30
## Mean :2017-01-21 10:58:03
## 3rd Qu.:2017-01-28 23:24:30
## Max. :2017-02-05 01:54:00
## reviewStartTimestamp reviewCompleteTimestamp
## Min. :2017-01-07 09:08:00 Min. :2017-01-07 09:35:00
## 1st Qu.:2017-01-14 12:05:45 1st Qu.:2017-01-14 12:24:15
## Median :2017-01-21 23:18:30 Median :2017-01-21 23:41:30
## Mean :2017-01-21 23:24:24 Mean :2017-01-21 23:54:28
## 3rd Qu.:2017-01-29 11:17:15 3rd Qu.:2017-01-29 11:55:00
## Max. :2017-02-05 13:49:00 Max. :2017-02-05 14:15:00
- Finally, let’s follow some best practices and explicitly define columns with the
col_types argument. We want to explicitly define compoundName and sampleType as factors. Note that the col_factor() expects a definition of the factor levels but you can get around this by supplying a NULL. Then run a summary to review the data.
readr_load_factors <- read_csv("data/2017-01-06_b.csv",
col_types = cols(
compoundName = col_factor(NULL),
sampleType = col_factor(NULL)
)
)
## Warning: The following named parsers don't match the column names:
## sampleType
summary(readr_load_factors)
## batchName instrumentName compoundName
## Length:3600 Length:3600 morphine :600
## Class :character Class :character hydromorphone:600
## Mode :character Mode :character oxymorphone :600
## codeine :600
## hydrocodone :600
## oxycodone :600
## calibrationSlope calibrationIntercept calibrationR2 batchPassed
## Min. :0.003172 Min. :-9.510e-05 Min. :0.9800 Mode:logical
## 1st Qu.:0.006794 1st Qu.:-2.160e-05 1st Qu.:0.9860 TRUE:3600
## Median :0.007060 Median : 6.965e-08 Median :0.9902
## Mean :0.007107 Mean : 7.202e-08 Mean :0.9899
## 3rd Qu.:0.007351 3rd Qu.: 2.180e-05 3rd Qu.:0.9938
## Max. :0.009626 Max. : 1.082e-04 Max. :1.0000
## reviewerName batchCollectedTimestamp
## Length:3600 Min. :2017-01-06 20:08:00
## Class :character 1st Qu.:2017-01-13 22:55:15
## Mode :character Median :2017-01-21 11:00:30
## Mean :2017-01-21 10:58:03
## 3rd Qu.:2017-01-28 23:24:30
## Max. :2017-02-05 01:54:00
## reviewStartTimestamp reviewCompleteTimestamp
## Min. :2017-01-07 09:08:00 Min. :2017-01-07 09:35:00
## 1st Qu.:2017-01-14 12:05:45 1st Qu.:2017-01-14 12:24:15
## Median :2017-01-21 23:18:30 Median :2017-01-21 23:41:30
## Mean :2017-01-21 23:24:24 Mean :2017-01-21 23:54:28
## 3rd Qu.:2017-01-29 11:17:15 3rd Qu.:2017-01-29 11:55:00
## Max. :2017-02-05 13:49:00 Max. :2017-02-05 14:15:00
End Exercise
For reference we can compare the time required to run the base read.csv() function with the readr read_csv() function using system.time().
Time to read with base:
system.time(base_load <- read.csv("data/2017-01-06_p.csv"))
Time to read with readr:
system.time(readr_load <- read_csv("data/2017-01-06_p.csv"))
Dealing with Excel files (gracefully)
You may have broken up with Excel, but unfortunately many of your colleagues have not. You may be using a little Excel on the side. (Don’t worry, we don’t judge!) So Excel files will continue to be a part of your life. The readxl package makes it easy to read in data from these files and also offers additional useful functionality. As with the other file reading functions, the syntax is pretty straightforward: read_excel("file_name.xlsx"). Excel files have an added layer of complexity in that one file may have multiple worksheets, so the sheet = "worksheet_name" argument can be added to specify the desired worksheet. Different portions of the spreadsheet can be read using the range arugment. For example a subset of rows and columns can be selected via cell coordinates: read_excel("file_name.xlsx", range = "B1:D6") or read_excel("file_name.xlsx, range = cell_cols("A:F").
If you are dealing with Excel data that is not a traditional tabular format, the tidyxl package is useful to be aware of. We will not cover it in this course but it is worth reading up on if you ever have to analyze a pivot table or some other product of an Excel analysis.
Exercise 3
You might be able to guess what comes next: we’ll read in an Excel file. 1. Use the read_excel() function to read the “orders_data_set.xlsx” file into a data frame 1. View a summary of the imported data 1. Now read in only the first 5 columns using the range parameter 1. Review the first 6 lines of the imported data
readxl_load <- read_excel("data/orders_data_set.xlsx")
summary(readxl_load)
## Order ID Patient ID Description Proc Code
## Min. : 10002 Min. :500001 Length:45002 Length:45002
## 1st Qu.: 32669 1st Qu.:503350 Class :character Class :character
## Median : 55246 Median :506862 Mode :character Mode :character
## Mean : 55133 Mean :506897
## 3rd Qu.: 77627 3rd Qu.:510421
## Max. :100000 Max. :513993
##
## ORDER_CLASS_C_DESCR LAB_STATUS_C LAB_STATUS_C_DESCR ORDER_STATUS_C
## Length:45002 Min. :1.000 Length:45002 Min. :2.000
## Class :character 1st Qu.:3.000 Class :character 1st Qu.:5.000
## Mode :character Median :3.000 Mode :character Median :5.000
## Mean :3.061 Mean :4.783
## 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
## NA's :7152 NA's :18
## ORDER_STATUS_C_DESCR REASON_FOR_CANC_C REASON_FOR_CANC_C_DESCR
## Length:45002 Min. : 1.0 Length:45002
## Class :character 1st Qu.: 11.0 Class :character
## Mode :character Median : 11.0 Mode :character
## Mean : 437.2
## 3rd Qu.:1178.0
## Max. :1178.0
## NA's :37794
## Order Time Result Time
## Min. :2017-08-13 11:59:00 Min. :2017-06-15 00:00:00
## 1st Qu.:2017-09-05 11:16:00 1st Qu.:2017-09-07 12:51:00
## Median :2017-09-27 08:48:00 Median :2017-09-29 14:06:30
## Mean :2017-09-27 09:39:30 Mean :2017-09-30 17:11:17
## 3rd Qu.:2017-10-19 13:45:00 3rd Qu.:2017-10-23 18:39:30
## Max. :2017-11-11 19:49:00 Max. :2017-12-29 07:37:00
## NA's :7152
## Review Time Department
## Min. :2017-08-15 09:16:00 Length:45002
## 1st Qu.:2017-09-15 23:32:30 Class :character
## Median :2017-10-12 14:22:00 Mode :character
## Mean :2017-10-11 06:52:47
## 3rd Qu.:2017-11-02 09:39:00
## Max. :2017-12-29 22:24:00
## NA's :7791
readxl_load_subset <- read_excel("data/orders_data_set.xlsx", range = cell_cols("A:E"))
head(readxl_load_subset)
## # A tibble: 6 x 5
## `Order ID` `Patient ID` Description `Proc Code` ORDER_CLASS_C_DE…
## <dbl> <dbl> <chr> <chr> <chr>
## 1 19766 511388 PROTHROMBIN TIME PRO Normal
## 2 88444 511388 BASIC METABOLIC P… BMP Normal
## 3 40477 508061 THYROID STIMULATI… TSH Normal
## 4 97641 508061 T4, FREE T4FR Normal
## 5 99868 505646 COMPREHENSIVE MET… COMP Normal
## 6 31178 505646 GLUCOSE SERUM, FA… GLUF Normal
End Exercise
Importing dirty data
To close out the discussion on reading files, there is one more useful package to introduce that helps with a variety of data cleaning functions. Since this is R, the package is cleverly and appropriately named janitor. The quick take home in terms of useful functions from this package: - clean_names() will reformat column names to conform to the tidyverse style guide: spaces are replaced with underscores & uppercase letters are converted to lowercase - empty rows and columns are removed with remove_empty_rows() or remove_empty_columns() - tabyl(variable) will tabulate into a data frame based on 1-3 variables supplied to it
Let’s take these functions for a spin using our data set. We are going to use the development version of the package because there is new, additional functionality. I will chain the commands together with pipes (which we’ll discuss in more detail in the next lesson).
First let’s review the first few lines of data after cleaning the columns names:
# install.packages("janitor", dependencies = TRUE) # uncomment to install if needed
# the development version of janitor handles cleaning names better than the current CRAN version
library(janitor)
readxl_load <- read_excel("data/orders_data_set.xlsx")
readxl_load_cleaned <- readxl_load %>%
clean_names()
head(readxl_load_cleaned)
## # A tibble: 6 x 15
## order_id patient_id description proc_code order_class_c_d… lab_status_c
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 19766 511388 PROTHROMBI… PRO Normal NA
## 2 88444 511388 BASIC META… BMP Normal NA
## 3 40477 508061 THYROID ST… TSH Normal 3
## 4 97641 508061 T4, FREE T4FR Normal 3
## 5 99868 505646 COMPREHENS… COMP Normal 3
## 6 31178 505646 GLUCOSE SE… GLUF Normal 3
## # ... with 9 more variables: lab_status_c_descr <chr>,
## # order_status_c <dbl>, order_status_c_descr <chr>,
## # reason_for_canc_c <dbl>, reason_for_canc_c_descr <chr>,
## # order_time <dttm>, result_time <dttm>, review_time <dttm>,
## # department <chr>
Now we’ll do a quick tabulation to count the different order classes in this orders data set:
readxl_load_cleaned %>% tabyl(order_class_c_descr)
## order_class_c_descr n percent
## Clinic Collect 6427 0.1428158749
## External 401 0.0089107151
## Historical 5 0.0001111062
## Normal 36326 0.8072085685
## On Site 1843 0.0409537354
Importing multiple files at once
One of the most compelling reasons to learn how to program is being able to expand your ability to automate or effortless repeat common actions and workflows. In most research and clinic lab environments, the data that people deal with day-to-day is not neatly stored in an easy-to-use database. It is often spread out over a series of messy spreadsheets that might be associated with one batch of data, one day of data, one week of data, or some variant. While the best practice for that scenario is probably to build a database to store the data, that requires a good amount of overhead and some expertise. By taking advantage of iteration in R, you can dump similiarly formatted files into data frames (tibbles).
The purrr package has a variety of map() functions that are well-explained in the iteration chapter of R for Data Science. The map() functions take a vector as an input, applies a function to elements of the vector, and returns a vector of identical length to the input vector. There are a number of map functions that correspond to the data type of the output. For example, map() returns a list, map_int() returns a vector of integers, map_chr() returns a character vector, and map_dfr() returns a data frame. These are very similar to the apply() family of functions but there are some advantages of the purrr functions, including consistent compabibility with pipes and more predictable output data types.
How does this work? Let’s take a simple example right out of the R for Data Science text. We’ll start with a tibble (tidyverse version of data frame) consisting of 4 variables (a through d) with 10 observations from a normal distribution.
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df
## # A tibble: 10 x 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 1.05 -1.09 0.0428 1.08
## 2 0.424 0.165 -0.570 -1.83
## 3 0.323 1.80 0.307 -0.139
## 4 0.306 0.704 2.14 0.294
## 5 -0.352 -0.984 -0.105 1.30
## 6 0.648 -1.27 0.0411 0.0215
## 7 -0.268 -0.536 0.340 1.07
## 8 0.343 -2.17 -0.478 -0.194
## 9 -0.473 -0.828 -1.00 0.637
## 10 -1.53 -0.374 0.245 0.430
We want to treat each variable as a vector and perform a calculation on each. If we want to take the mean of each and want the output to have a double data type, we use map_dbl():
df %>%
map_dbl(mean)
## a b c d
## 0.04769237 -0.45835602 0.09574256 0.26677594
That is a pretty simple example but it captures the types of operations you can you do by iterating through a data set. For those of you who are familiar with for loops, the map functions can offer similar functionality but are much shorter to write and straight-forward to understand.
Earlier in this lesson we discussed file reading functions, with the recognition that many data analysis tasks rely on flat files for source data. In a laboratory running batched testing such as a mass spectrometry lab, files are often tied to batches and/or dates and named correspondingly. If you want to analyze a set of data over multiple batches, you may find yourself importing data from each individually and stitching together the data using a function like bind_rows() (we will discuss this function in a future lesson). The map() functions (often map_dfr() specifically) can automate this process and save you a lot of time. There are a few prerequisites for this to work, though: - the underlying file structure must be the same: for spreadsheet-like data, columns must be in the same positions in each with consistent data types - the files must have the same file extension - if there are multiple different file types (with different data structures) mixed in one directory, the files must organized and named in a way to associate like data sets with like
In the last lesson we placed our large mass spec data set in the data folder. This consists of a series of monthly data that are grouped into batches, samples, and peaks data, with suffixes of "_b“,”_s“, and”_p“, respectively. Let’s read all of the sample data into one data frame (technically a tibble). We are going to use the read_csv() function since the files are csvs. To use the map_dfr() function, we need to supply a vector as input - in this case, a vector of file names. How do generate that input vector? - First we use list.files(), which produces a character vector of names of files in a directory, which is the first argument. The function allows a pattern argument which you can supply with a text string for it to match against - all of the sample files end in”_s.csv“. - Next we pipe that list to file.path(), which provides an operating system agnostic way of spitting out a character vector that corresponds to the appropriate file name and path. We started with the names of the files we care about, but we need to append the”data" folder to the beginning of the names. You’ll notice that we used a period as the second argument - this is because by default the pipe feeds the output of the previous step into the first argument. The period is a placeholder to indicate that the output should be fed into a different argument. - Finally we feed that character to map_df(), which takes the read_csv() function as its argument. With the map family of functions, there is no need to include the parentheses in the function name if there arent’ arguments.
all_samples <- list.files("data", pattern = "_s.csv") %>%
file.path("data", .) %>%
map_dfr(read_csv) %>%
clean_names()
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
## Parsed with column specification:
## cols(
## batchName = col_character(),
## sampleName = col_character(),
## compoundName = col_character(),
## ionRatio = col_double(),
## response = col_double(),
## concentration = col_double(),
## sampleType = col_character(),
## expectedConcentration = col_integer(),
## usedForCurve = col_logical(),
## samplePassed = col_logical()
## )
summary(all_samples)
## batch_name sample_name compound_name ion_ratio
## Length:2244840 Length:2244840 Length:2244840 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :0.8165
## Mean :0.6564
## 3rd Qu.:1.2452
## Max. :2.4332
## response concentration sample_type
## Min. :0.0000 Min. : 0.00 Length:2244840
## 1st Qu.:0.0000 1st Qu.: 0.00 Class :character
## Median :0.2982 Median : 42.55 Mode :character
## Mean :0.9658 Mean :134.46
## 3rd Qu.:1.8593 3rd Qu.:261.81
## Max. :9.2258 Max. :860.59
## expected_concentration used_for_curve sample_passed
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 0.00 FALSE:1956363 FALSE:57190
## Median : 0.00 TRUE :288477 TRUE :2187650
## Mean : 35.77
## 3rd Qu.: 0.00
## Max. :500.00
If you weren’t already aware of this solution or another for reading in multiple files at once, the purrr package is an extremely handy tool for doing this. Just be aware of the requirements for doing this, and always check the output. You do not want to automate a bad or broken process!